YouTube Trending Videos Analysis

Importing required libraries:

Dataset Description

File type: csv

  1. video_id: Uniquely identifies each video
  2. published_at: Date and Time of video published
  3. categoryId: Id of category the video belongs to
  4. trending_date: Date and time when the video got to Trending
  5. view_count: Number of views (cumulative)
  6. likes: Number of Likes(cumulative)
  7. dislikes: Number of dislikes(cumulative)
  8. comment_count: Number of comments(cumulative)
  9. country: Country in which the video was trending
  10. description: Description of video by the creator
  11. tags: Tags of the video by the creator
  12. title: Title of the video
  13. channelTitle: Channel Title of the video
  14. thumbnail_link:link for thumbnails
  15. comments_disabled: boolean value that defines if viewer can comment
  16. ratings_disabled: boolean value that defines if viewer can rate through likes and dislikes
  17. channelId: uniquely defines the channel the video is coming from

File type: json

  1. id: Id of category the video belongs to
  2. name: Respective category names of category ids

Importing individual .csv file of USA, Great Britain, Canada from Home directory:

Finding the shape of these 3 dataframes

Opening JSON file and loading the required data to match categoryId to its respective category name:

The dataset we downloaded from Kaggle has two types of files for each country, one is 'video.csv' which contains all the features described before, and the other is 'category.json' which contains mapping for category id to category names.

To merge this information together, we did the following steps:

1. Load JSON File for each country

2. Since JSON file was in nested format we used json normalize function from pandas to flatten it and read into data frame

3. Merging videos dataframe and category dataframe for all countries using left join

Noting the shape again after the merge

Adding a column 'country' to identify country specific information after appending the 3 countries - USA, Great Britain, Canada

Appending Data from 3 countries:

Making a list of all countries and using pd.concat function to append data for all countries in one dataframe

Data Cleaning

Dropping columns that are irrelevant for analysis:

The above code is a proof that the columns mentioned are safely dropped without disturbing the rows.So,using inplace= True parameter with drop

Data coherency

Finding if there are any True values in comments_disabled and ratings_disabled columns, which represents that comment_count is 0 if comments_disabled is True, and likes ,dislikes should be 0 if ratings_disabled is True

The records with True values in Comments_disabled and ratings_disabled have 0 corresponding values(likes, dislikes,comment_count).Hence the data is coherent and it is safe to drop the above 2 columns

Checking for Null values

There are null values in category_name and description

1. Dealing with Category_name null values

2. Dealing with description null values

1. Dealing with null values in category_name

For categoryID=29 , USA has a category name NonProfits&Activism whereas CA,GB did not define a categoryname. By taking a look at the Video title of those records,we can come to a conclusion that they represent NonProfits&Activists category. So, replacing Nan's with NonProfits&Activism.

2. Dealing with Null values in description

filling missing values in description with ''

Understanding data in descriptive columns

The function defined below checks if the text contains non english characters or not

Function to remove non english characters from the text:

Validating the function 'removeNonEnglishWords'

Creating a flag to check if each record in description is in English

False--- indicates the presence of non- English characters in description

True---- indicates all the words in its description are in English

adding a new feature des (with cleaned description ) into the dataframe

From the above code, des column has only the words in that are in English .So it is safe to drop Description and isEnglish(flag) column.

Now applying the same logic on tags, title and channel_title columns to ensure the presence of just English words

Tags cleaning:
adding a new feature c_tags (with cleaned tags ) into the dataframe

Now, the c_tags column has just the english words from tags column. hence, it is safe to drop isEnglish and tags columns from df

Title(video title)

The above code has cleaned tilte column and stored into c_title column and hence we are dropping isEnglish and tiltle column from df

Channel_Title

Dropping off isEnglish and channelTitle columns and keeping the cleaned channel_title

Converting Datatypes for Analysis

2. Converting country to category

Rename the columns for easier understanding and uniformity

Data Correctness

We observed few records with likes and dislikes greater than view_count, which is practically impossible.

Dropping records with likes > view_count.

Resetting index after dropping few rows

Writing this cleaned file to compressed csv

-------------------------------------Verified clean data for analysis---------------------------------------------------------

Now the Dataset looks clean for some exploration

Resetting index back as we set index to Video_id for writing the file to csv

Cleaned Dataset Description

  1. video_id------------------------- Uniquely identifies each video
  2. published_at----------------------- Date and Time of video published
  3. category_id------------------------ Id of category the video belongs to
  4. trending_date---------------------- Date and time when the video got to Trending
  5. view_count------------------------- Number of views (cumulative)
  6. likes------------------------------ Number of Likes(cumulative)
  7. dislikes--------------------------- Number of dislikes(cumulative)
  8. comment_count---------------------- Number of comments(cumulative)
  9. category_name----------------------- Name of category corresponding to Id
  10. country---------------------------------- Country in which the video was trending
  11. description----------------------------- Description of video by the creator
  12. tags--------------------------------------- Tags of the video by the creator
  13. video_title------------------------------ Title of the video
  14. channel_title-------------------------- Channel Title of the video

FINDING 1:

GOAL: To plot the Time Series for each trending date over number of videos viewed, in news and politics category during the pandemic(2020-2021) Using Plotly Time series

We analyzed various categories but found interesting insights in category 'News and Politics' so chossing it for further analysis

FINDING: There are 2 unusual spikes with significantly more views in the month of November 2020 and between May and July2021

NOVEMBER 2020

To generate the wordcloud we used the wordcloud library, to filter the data into the time range we used the date mask from october 1 to december 1, and we combined all the tags and title for all videos into a single string which we passed to the wordcloud generator and got the words with maximum frequency in a bigger size

MAY-JULY 2021

Summary: It can be inferred form the above wordcloud that the May-July 2021 spike was caused due to 2 major happenings:

1. Hamas attacks on israel

2. Deadly Covid-19 second wave in India

-------------------------------------------------------------------------------------------------------------------

FINDING 2:

To find the popular categories by country

USA

Great Britain

Does Americans prefer certain Tags and titles over the British?

Finding: Yes. In USA , people have varied interests in both the categories. Whereas, in Great Britain , The most watched videos are predominantly Football and more exclusively Manchester United in 2020-21.

INSIGHT:

--------------------------------------------------------------------------------------------------------------

FINDING 3: (MACHINE LEARNING)

Keeping only one unique record of each video_id

Dropping columns except title description and tags which will help us to predict category

Stopword Removal and Word Stemming

Defining a function to remove stop words and perform word stemming(which converts word into it's root form)

Validating preprocessing method :

Classification for Prediction

Doing prediction for categories with higher value counts to keep the dataset balanced

-df_filtered - the dataframe only with categories 'Entertainment', 'Sports', 'Music', 'Gaming','People & Blogs', 'Comedy','News & Politics'

Bag of Words Model: It is represnetation of text that describes the occurence of words within a document. We use this to convert text into numerical representation and use it in the classification model

Confusion Matrix

Using classification report which gives us metric like 'precision', 'recall'and 'f1 score' for all the categories taken into consideration

Since we are predicting multiclass, we cannot calculate AUC acore in our scenario

TFIDF Concept: It is another form of feature representation of textual data. In simple words, it assigns weight to every word depending on how unique a word is to a given document. Weight of every word is proportional to the frequency of that word in it's own document/text and inversly proportional to occurence in the other documents/texts

Confusion Matrix

Using classification report which gives us metric like 'precision', 'recall'and 'f1 score' for all the categories taken into consideration

Managerial Insight:

--------------------------------------------------------------------------------------------------------------

Finding 4

Prepare the dataset fo Machine learning

Checking data for a specific video_id

Now the dataset is ready for Machine learning

Make X,Y

Build and Train

Visualize the tree

Regression for Prediction

Splitting X and Y for training and test

Train and predict with LinearRegession

Finding the coefficients

Coefficient Determination

Metrics

Mean Absolute Error and Mean Absolute deviation

MAD

Mean Squared Error/MSE

Calculating best regressor for the dataset

Goal is to find the predictor which minimized the cross-validated MAD